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Migrating  MATLAB  to  ZPL 


Abstract  „  ,  ,  ^ 

The  task  of  migrating  MATLAB  programs  to  ZPL  so  that  the  computations  can  mn  on  parallel  platforms 
and  achieve  significant  performance  improvements  entails  three  tasks:  Upgrade  ZPL  to 
representations,  provide  an  interface  to  a  parallel  scientific  library,  and  provide  a  mechanism  by  which 
programmers  ca^  know  when  their  MATLAB  programs  have  limited  parallelism.  ^ 

these  three  goals,  although  the  original  proposal’s  plan  to  solve  the  latter  problem  on  the  MA^AB  side 
was  replaced  by  a  more  effective  solution  that  solves  it  on  the  "ZPL  side.  T^  software  is  fully 
implemented  and  available  free  of  charge  over  the  WWW.  There  is  a  small  ZPL  user 
benchmark  tests  ZPL  programs  are  shown  to  perform  as  well  as  or  better  than  prograim^tten  by  expe^ 
using  C  and  message  passing.  ZPL  programs  are  fully  portable  running  well  on  any  UNIX  platform.  And 
the  language  is  convenient,  automatically  producing  all  concurrency,  all  communication  and  vei^y 
aggressive  scalar  optimizations.  Additionally,  a  substantial  amount  of 

languages,  parallel  compilation  and  compiler  optimizations.  This  research  produced  two  dozen  techn 
papers  and  four  PhD  dissertations. 

Generally  MATLAB  is  a  forgiving,  powerful  and  slow  (to  execute)  means  of  expressing  scientific 
computation.  Generally,  ZPL  is  an  exacting,  powerful  and  fast  means  of  expressing  scientific 
compulations.  The  forgiving  vs  exacting  and  the  slow  vs  fast  tradeoffs  are  embodied  in  fte  diff^ncM 
SJTTrinterpreted  »  corded  langnnge.  The  goal  of  this  -^  was  to 
from  the  forgiving  to  the  exacting  in  order  to  replace  the  slow  with  the  fast.  That  is,  to  have  the 

convenience  and  the  speed  too. 

There  are  three  fundamental  challenges  to  converting  MATLAB  programs  to  first  is 

the  ability  to  execute  the  MATLAB  language  constructs  efficiently.  Since  the  effort  began  with 
already  performing  most  of  the  MATLAB  operations  faster,  in  parallel  and  with  ^ 

MATLAB  itself,  the  main  challenge  was  MATLAB ’s  support  for  sparse  arrays.  MATLAB  was  ‘he  first, 
and  when  this  project  started  the  only,  language  supporting  sparcity.  Its  sparse  matrices  were  largely 
S^nsparenuo  the  user,  so  the  performance  advantages  were  limited.  The  challenge  of  proving  a  langiage 
kvel  sparse  array  capability  that  was  both  parallel  and  high  performance  had  never  been  achieved.  This 

project  has  accomplished  this  goal,  as  explained  below,  solution  not  only  providj  f^Lp^^^^^^  sp 

Lays  as  a  fundamental  data  type  of  ZPL,  but  the  technology  goes  well  beyond 

MATLAB  and  is  general  enough  to  apply  to  any  language,  parallel  or  sequential.  Thus,  ZPL  covers 

MATLAB  in  the  sense  of  running  all  of  its  abstractions  fast  and  in  parallel. 

The  second  challenge  is  to  provide  a  parallel  interface  to  library  routines,  since  most  of  MATL/^’s  value 
SmSTm  couLiem  Lerface  m  powerful  ecieudfrc  soflwam.  The  probtau !"  ^ 

that  parallel  scientific  software  is  an  enormous  research  area  in  its  own  right.  We  interact^  o  ”in’ 

most  well  known  scientific  software  groups.  Jack  Dongarra’s  SCALapack  f"LtLs  for  both  we^ 
PLAPACK  group.  Though  ZPL  can  interface  to  both,  and  we  have  worked  out  the  details  for  both,  we 
L^d^mor^^  our  soLon  using  PLAPACK.  It  is  the  work  of  a  week  for  a  SCALapack  expert  to 

interface  to  that  library. 

The  third  challenge  concerns  the  fact  that  the  MATLAB  language  is  a  sequential  language,  but  to  run  !  .  * 
enough  for  serious  scientific  computations,  it  must  be  run  in  parallel.  Being  sequential  means  that 
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whenever  programmers  are  not  using  the  scientific  software,  i.e.  when  the  are  programming  directly  in  the 
language,  they  are  writing  code  that  may  or  may  not  have  efficient  parallel  execution.  This  problem  was 
described  in  depth  in  the  proposal,  and  it  was  noted  there  that  it  couldn’t  be  solved.  That  is,  to  solve  it  is 
tantamount  to  claiming  a  "general  purpose  automatic  parallelization"  technology,  which  has  been  promised 
by  many  researchers  for  decades  and  not  achieved.  We  believe  that  general  automatic  parallelization  is  not 
a  realistic  goal.  So,  the  plan  in  the  proposal  was  to  create  a  programmer’s  aid  that  would  identify  those 
places  in  the  converted  program  that  were  not  parallel.  Since  it  became  obvious  early  on  when  NSF  failed 
to  provide  the  funding  to  match  DARPA’s  that  such  an  ambitious  software  project  was  not  feasible,  we 
have  developed  an  alternative  as  part  of  our  best  effort.  We  have  developed  an  abstraction  called  ZPL’s 
WYSIWYG  performance  model  [2],  which  enables  programmers  to  have  the  information  that  an  analyzer 
would  normally  have.  In  a  sense  this  solution  is  superior  to  the  "programmer’s  aid"  because  the 
information  can  be  used  both  for  creating  ZPL  from  MATLAB  as  well  as  writing  ZPL  programs  from 
scratch.  The  latter  would  have  been  impossible  without  a  "ZPL  side"  solution. 

So,  the  technical  goals  of  the  project  -  support  MATLAB ’s  operations,  support  scientific  libraries  and 
handle  the  sequential  nature  of  MATLAB  program  text  -  have  been  achieved.  In  addition  there  has  been  a 
substantial  addition  to  the  capability  of  ZPL  including, 

•  sparse  regions 

•  Mscan 

•  problem  space  promotion 

•  advanced  optimizations 

These  will  be  discussed  below.  Further,  the  project  supported  a  dedicated  cadre  of  users  in  applying  ZPL 
to  scientific  problems,  and  received  considerable  feedback  regarding  practical  applications.  At  the 
completion  of  this  research,  ZPL  is  a  freely  distributed  parallel  programming  language  capable  of  hosting 
MATLAB  programs  and  running  them  in  parallel  for  dramatic  speed  improvements.  Interestingly,  some 
MATLAB  programmers  have  said  that  rather  than  converting,  theyll  take  the  opportunity  to  develop  a  new 
program  directly  in  ZPL. 

The  remainder  of  this  report  gives  technical  substance  to  the  topics  raised  in  the  Overview. 


Sparse  Regions  and  Arrays 

During  the  1990s  the  state-of-the-art  in  parallel  algorithms  improved  dramatically,  going  from  the  naive 
"dense"  solutions  so  common  previously  to  solutions  involving  much  more  sophisticated  data  structures, 
especially  sparse  arrays.  Languages  like  Fortran  90/95  and  High  Performance  Fortran  require  programmers 
to  implement  sparse  structures  manually.  This  is  not  only  very  difficult  work  for  programmers,  but  the 
compiler  is  unable  to  determine  what  the  program  is  actually  doing,  and  so  cannot  perform  sophisticated 
optimizations.  MATLAB  sought  to  help  the  programmer  by  constructing  a  "black  box"  sparse  array  that 
the  programmer  could  declare  but  otherwise  could  not  affect  or  be  aware  of  how  it  was  being  used.  ZPL 
through  this  award  has  created  the  first  language  level  abstraction  for  sparse  arrays,  implemented  it,  shown 
how  to  compile  it  to  run  fast  in  parallel  and  demonstrated  it  on  sparse  benchmarks.  This  is  a  significant  and 
fundamental  accomplishment. 

The  key  insight  required  to  introduce  sparse  arrays  into  ZPL  is  to  recognize  that  dense  arrays  are  defin^ 
and  transformed  using  dense  regions  [4].  Therefore,  extending  this  notion  to  sparse  arrays  only  requires 
the  invention  of  sparse  regions.  Regions  are  index  sets,  and  a  powerful  new  idea  in  7PL.  For  the  dense 
index  case,  i.e.  those  common  cases  such  as  n  x  n  arrays,  regions  are  specified  by  giving  their  index  range, 

as  in 


region  R  =  [l..n,  l..n] 

which  specifies  the  n^  set  of  indices  from  (1,1)  through  (n,n).  Though  this  looks  like  an  array  definition  in 
another  language,  it  declares  only  the  indices.  The  n  x  n  arrays  A,  B  and  C  could  be  declared  from  this 
region  by 
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var  A,  B,  C  :  [R]  double; 

which  specifies  that  each  array  has  elements  and  the  elements  are  double  precision  floating  point 
numbers  data. 

The  sparse  case  is  considerably  more  complicated  [10].  First,  there  is  the  representation,  which  in  the  dense 
case  requires  only  the  lower  and  upper  bounds,  the  stride  and  the  starting  position.  In  the  sparse  case  a  full 
data  structure  must  be  created  to  keep  track  of  each  represented  item  in  the  sparse  structure  (known 
commonly  as  a  nonzero).  Further,  the  structure  must  support  all  of  the  ZPL  data  traversals.  This  structure 
requires  significant  memory  and  so  its  aggressive  optimization  is  essential  or  the  program  will  suffer 
adverse  cache  affects.  The  other  problem  with  sparse  arrays  is  that  the  represented  elements  must  be 
specified.  This  is  sometimes  static,  as  with  tridiagonal  matrices.  Most  commonly,  the  nonzeroes  are 
known  at  the  start  of  the  computation  and  can  either  be  computed  at  initialization  time  or  read  in  from  a 
file.  The  most  dynamic  case  is  when  the  configuration  of  the  nonzeroes  changes  incrementally  as  the 
computation  evolves.  The  current  implementation  handles  the  first  two  cases. 

When  measured  on  the  NAS  conjugate  gradient  benchmark,  which  has  a  programmer  produced  sparse  data 
structure,  the  ZPL  compiler  is  amazing  [11].  It  is  able  to  match  both  the  footprint,  i.e.  the  memory  usage, 
and  the  performance  of  a  high  quality  parallel  program.  The  source  text  for  ZPL  is  trivial  for  the  core 
sparse  matrix- vector  multiplication,  whereas  it  runs  to  pages  for  the  hand-coded  version  because  of  all  of 
the  conununication. 

The  sparse  array  work  is  the  core  of  Bradford  L.  Chamberlains  dissertation  research  [10],  and  has  recently 
appeared  at  an  international  conference  [1 1].  It  is  too  recent  to  see  whether  this  will  be  incorporated  into 
other  programming  languages,  but  it  is  sufficiently  labor-saving  from  the  programmer’s  point  of  view  and 
sufficiently  effective  at  enabling  compiler  optimizations  that  it  is  likely  to  be  included  in  other  future 
systems. 

ZPL  Release 

Just  six  months  after  the  start  of  the  award  the  ZPL  compiler  was  publicly  released.  Of  course,  most  of  the 
development  was  supported  on  previous  awards,  but  the  present  award  assisted  in  the  distribution  and  user 
support,  which  was  crucial  to  the  feedback  needed  for  the  research.  The  free  ZPL  software,  comprised  of  a 
compiler,  libraries  and  documentation,  was  and  remains  the  only  high  level  parallel  programming  language 
that  can  claim  performance,  portability  and  convenience  [8].  "Performance"  in  this  claim  means  that  the 
compiler  produces  from  the  high  level  source,  object  code  that  runs  as  fast  as  a  program  written  by  an 
experienced  programmer  in  C  with  message  passing,  the  present  industry  sUndard  [9].  Recent  comparisons 
reveal  that  even  experienced  progranuners  cannot  write  code  that  runs  as  fast  as  ZPL,  even  for  a  sequential 
computer.  "Portability"  means  that  ZPL  runs  well  on  any  Unix/Linux  platform,  which  includes 
contemporary  parallel  machines  as  well  as  all  sequential  machines.  It  is  a  fundamental  fact  of  computer 
science  (universality  theorem)  that  any  program  can  run  on  any  computer,  so  the  import  of  this  remark  is 
the  "runs  well”  claim.  Expect  a  well-written  ZPL  program  to  run  well  on  every  platform.  [A  serious  effort 
was  made  to  port  ZPL  to  Microsoft’s  NT,  but  the  effort  eventually  failed  as  the  operating  system  is  very 
difficult  to  work  with  where  performance  is  concerned.)  "Convenience"  means  that  the  programs  are 
simple  and  clear.  An  example  of  one  user’s  program  required  2.5  pages  to  solve  a  multigrid  combustion 
computation  in  ZPL  and  12.5  pages  in  C  with  MPI  message  passing  -  and  the  ZPL  program  ran  more  than 
twice  as  fast! 

ZPL’s  release  has  attracted  a  small,  but  dedicated  set  of  users.  These  users  have  not  only  made  the  compiler 
more  robust  by  testing  out  its  facilities,  but  they  have  provided  the  raw  material  for  both  the  language 
design  and  the  performance  studies. 

Mscan 

One  of  the  most  pioneering  advancements  in  the  present  compiler  is  the  creation  of  a  high-level 
programming  abstraction  for  pipelining  [6,7].  As  is  well  known  pipelining  is  one  of  the  most  powerful  and 
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widely  used  forms  of  parallelism.  However,  no  high-level  parallel  language  supported  it  directly  despite 
the  fact  that  certain  classic  scientific  computations,  like  solvers,  must  use  pipelining  to  achieve  any 
performance  at  all. 

One  serious  limitation  with  introducing  pipelining  into  an  array  language,  say  for  wavefront  computaUons, 
is  that  it  is  contrary  to  array  language  semantics.  To  accumulate  the  rows  of  an  array  one  might  wish  to 
write,  in  ZPL  style, 

A  : =  A  +  A@north 

which  seems  like  it  should  take  the  rows  of  array  A  and  replace  each  with  itself  and  the  row  above  it, 
leading  to  an  accumulating  sum.  However,  array  language  semantics  require  that  the  right-hand  side  be 
evaluated  entirely  before  the  assignment.  To  get  the  desired  wavefront  motion,  ZPL  introduces  the  prime 
operator,  so  the  correct  alternative  to  the  previous  statement  is 

A  ;=  A  +  A' ©north 

which  produces  an  accumulating  sum  and  pipelines  the  result  on  parallel  computers. 

Though  the  prime  operator  is  an  example  of  applying  commonly  understood  metaphors  to  achieve  new 
results  it  doesn’t  quite  solve  the  problem,  because  most  scientific  computation  is  more  complicated  than  a 
single  statement.  For  that  reason,  the  mscan  keyword  was  introduced  to  allow  pipelining  across  a  range  of 
statements,  (mscan  takes  its  name  from  "mighty  scan",  the  term  used  in  Ton  Ngo’s  thesis,  where  the  idea 
was  invented  [12].)  The  fundamental  research  to  incorporate  pipelining  into  ZPL  and  other  languages  was 
the  PhD  dissertation  of  E  Chris  Lewis  [13]. 


Advanced  Optimizations 

One  of  the  fundamental  rules  that  releasing  the  compiler  to  the  public  taught  the  ZPL  team  was  that  great 
parallel  performance  is  useless  unless  great  scalar  performance  is  also  achieved.  That  is,  even  it  the 
processors  are  working  well  together  -  and  ZPL  is  outstanding  at  achieving  that  -  the  efficiency  of  the 
computation  on  an  individual  processor  is  just  as  important  if  performance  is  to  eclipse  programmer- 
produced  code.  For  that  reason  the  team  has  worked  intensively  at  both  parallel  optimizations  and 
sequential  optimizations.  Most  of  this  work  is  published,  but  an  enumeration  of  it  here  is  useful. 


•  Communication  optimizations  -  the  dissertation  topic  of  Sung-Eun  Choi  [15]  shows  how  ZPL  can 
optimize  interprocessor  communication  to  achieve  better-than-message  passing  performance  [14J.  A 
key  aspect  of  the  approach  is  the  Ironman  communication  abstraction.  The  bottom  line  result  ot  this 
dissertation  is  that  well  designed  compilers  are  more  effective  that  humans  at  inserting  interprocessor 
communication,  raising  the  question  "Why  is  message  passing  so  popular? 


Fusion  and  Contraction  -  scalar  language  compilers  create  temporaries  to  hold  interm^iate  results, 
but  when  an  array  language  does  it  there  is  a  significant  impact  on  storage.  Removing  this  problem 
was  an  important  goal  of  the  project  because  the  temporaries  ruin  cache  performance,  a  ke^dvantage 
of  parallel  machines  that  should  not  be  lost.  The  net  result  is  that  an  aggressive  compiler  (^L)  can 
remove  not  only  the  temporaries  introduced  by  the  compiler  but  also  those  introduced  by  the 
programmer  [3, 13]. 

Collective  Communication  Optimizations  -  Parallel  computations  require  such  things  as  global 
sums,  known  as  "collective"  operations.  The  communication  patterns  for  these  are  quite  different  than 
those  for  other  operations,  so  it  makes  since  to  try  optimizing  them.  This  was  the  task  of  Derric 
Wethersby’s  dissertation  [16],  which  showed  that  combining  and  pipelining  were  powerful  techniques 
to  reduce  the  wait  times  and  overheads  for  communication  in  collective  operations. 


Other  less  grandiose  optimizations  have  been  incorporated  in  the  compiler,  though  they  have  not  lead  to 
dissertation  research. 
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Problem  Space  Promotion 

Part  of  the  challenge  in  parallel  language  and  compiler  design  is  to  determine  how  a  problem  should  be 
solved  in  parallel  in  the  first  place.  Once  this  is  known  then  the  concepts  can  be  incorporated  into  the 
language  and  the  compiler  can  be  designed  to  produce  the  code.  One  technique  is  Problem  Space 
Promotion.  The  idea  is  that  computations  are  usually  solved  in  the  "dimensionality"  in  which  they  are 
represented,  e.g.  matrix  multiplication  is  solved  in  2-dimensions  because  matrices  are  2-dimensional.  But, 
it  is  often  possible' to  specify  what  is  to  be  computed  by  raising  the  dimensionality  of  the  solution  and 
thereby  avoid  over-specifying  how  it  is  to  be  computed.  Without  over-specifying  the  compiler  has  more 
latitude  to  create  an  efficient  solution.  So,  matrix  multiplication  can  be  solved  in  3-dimensions  by  thinking 
of  each  operand  array  as  being  replicated  n  times  (n  is  the  common  dimension),  the  corresponding  elements 
multiplied  elementwise  and  then  the  dimensionality  reduced  by  summing  along  the  common  dimension. 

The  result  is  a  specification  of  matrix  product  with  only  computationally  required  dependencies  given. 

PSP  opportunities  arise  repeatedly  [5]. 

The  idea  of  PSP  computations  seems  clear,  but  with  so  much  latitude,  it  is  complicated  for  the  compiler  to 
figure  out  how  best  to  solve  the  promoted  problem.  The  team  took  on  as  the  goal  to  do  as  well  as  an  expert 
programmer,  which  amounts  to  avoiding  the  creation  of  higher  dimensional  intermediate  arrays  If  this 
happened  it  could  often  overflow  memory,  since  for  example,  multiplying  1000x1000  arrays  of  doubles, 
requires  an  intermediate  array  of  8GB.  But,  even  when  it  doesn't  overflow  the  memory,  it  will  surely 
overflow  the  cache,  an  equally  bad  outcome.  The  ZPL  compiler  generates  code  for  "problem  space 
promoted"  computations  that  achieves  both  efficient  intermediate  memory  usage  as  well  as  high 
performance  [3]. 

What  You  See  Is  What  You  Get 

The  development  of  the  WYSIWYG  model  of  parallelism  turned  out  to  be  critical  to  enabling  MATLAB 
programmers  to  know  when  their  corresponding  ZPL  programs  would  have  limited  parallelism.  But,  the 
original  purpose  of  WYSIWYG  [2],  was  to  write  good  parallel  programs  from  scratch.  This  is  a  pioneering 
idea,  and  it  works  like  this. 

When  programmers  write  in  C  or  Fortran  they  believe  they  "know"  what  the  generated  code  will  look  like. 
In  actuality,  they  are  often  surprised  because  aggressive  compilers  often  transform  source  code 
tremendously,  but  that’s  not  the  issue.  The  point  is  that  programmers  "know"  because  there  is  a  standard 
model  of  sequential  computers  (von  Neumann)  and  the  model  tells  them  how  efficiently  their  program  will 
run.  (The  compiler’s  transformations  are  improving  on  this,  so  their  understanding  is  the  worst-case 
performance.)  In  the  parallel  world  only  ZPL  has  adopted  a  standard  model,  the  CTA  model.  In  the  same 
way  that  the  von  Neumann  model  tells  Fortran  programmers  how  their  code  will  run,  unless  the  compiler 
will  do  better,  the  CTA  tells  ZPL  programmers  how  well  their  program  will  run,  unless  the  compiler  can  do 
better.  The  CTA  concentrates  on  those  features  like  interprocessor  communication  and  latency  that  are 
peculiar  to  parallel  computers,  leaving  the  details  of  the  scalar  processor  to  the  von  Neumann  model. 

Though  this  appears  to  be  an  amazingly  obvious  requirement  for  parallel  programming  success,  it  is  not  a 
property  of  any  other  parallel  programming  language.  Further,  it  cannot  be  a  property  of  any  programming 
approach  based  on  message  passing.  To  note  how  well  it  works,  the  project  members  took  two  standard 
matrix  multiplication  problems  and  wrote  them  in  ZPL.  The  programs  were  quite  different,  of  course,  but 
using  the  WYSIWYG  model,  it  was  possible  to  do  a  back-of-the-envelop  analysis  of  which  program  would 
run  faster.  A  MATLAB  programmer  would  do  this.  Once  completed,  a  series  of  experiments  across  a 
series  of  parallel  programs  showed  that  the  WYSIWYG  performance  prediction  was,  indeed,  true  [2].  As 
always,  the  ability  to  correctly  predict  an  outcome  is  the  hallmark  of  quality  science. 

Summary 

As  a  result  of  this  award  it  is  now  possible  to  migrate  programs  from  MATLAB  to  ZPL.  If  the  programs 
use  the  sparse  features  of  MATLAB,  then  the  sparse  features  of  ZPL  will  be  used.  In  addition  to  the  basic 
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goals  of  the  project,  a  large  body  of  associated  and  related  scientific  research  were  also  created.  Four 
graduate  students  wrote  doctoral  dissertations  under  its  auspices.  All  of  the  features  of  this  report  are 
implemented  and  are  available  free  of  charge  to  the  community.  A  small  cadre  of  programmers  uses  ZPL 

routinely. 
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